-
Notifications
You must be signed in to change notification settings - Fork 612
Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put in https://github.com/pytorch/torchtitan/tree/main/tests/assets and just call it llama3_losses.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b q:Is this ground truth loss come from a single GPU run? Or is it FSDP only?
| # Verify the accuracy. | ||
| echo "Checking FSDP4 v.s. HSDP2FSDP2TP2 accuracy parity" | ||
| echo "Checking FSDP4 v.s. HSDP2FSDP4 accuracy parity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found HSDP2FSDP4 confusing.
Is this FSDP 8 vs. HSDP (4, 2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can use HSDP (4, 2), I don't think we have a formal way to write HSDP, or do we, lol?
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move assets/ci_llama3_losses.txt to tests/assets/llama3_losses.txt.
If we want to extend this to more models, I would suggest creating a folder tests/assets/losses/llama3.txt
| python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --baseline-ngpus=4 --test-ngpus=8 --steps=1 | ||
| export test_options="--parallelism.data_parallel_replicate_degree=4" | ||
| python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --steps=10 --import-result assets/ci_llama3_losses.txt | ||
| rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because you dump something to the folder, so that later runs will complain it's not empty? I think we should make this dump folder optional so that the first run doesn't use it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we need artifacts-to-be-uploaded is that torchtitan will output something to the output folder and the default is outputs. But creating outputs will fail because the file system is read-only. So, basically if we want to run a TorchTitan job, we will need to redirect the outputs to artifacts-to-be-uploaded.
I feel it is too much to make outputs to be optional because there will be several checks in the trainer. And all these are just for CI. I would rather say the integration tests shouldn't expect the folder to be empty.
[ghstack-poisoned]
Stack from ghstack (oldest at bottom):
This PR allows us to check if the loss is consistent across commits/PRs.